AITopics | scientific document

Collaborating Authors

scientific document

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

TeXpert: A Multi-Level Benchmark for Evaluating LaTeX Code Generation by LLMs

Kale, Sahil, Nadadur, Vijaykant

arXiv.org Artificial IntelligenceSep-16-2025

LaTeX's precision and flexibility in typesetting have made it the gold standard for the preparation of scientific documentation. Large Language Models (LLMs) present a promising opportunity for researchers to produce publication-ready material using LaTeX with natural language instructions, yet current benchmarks completely lack evaluation of this ability. By introducing TeXpert, our benchmark dataset with natural language prompts for generating LaTeX code focused on components of scientific documents across multiple difficulty levels, we conduct an in-depth analysis of LLM performance in this regard and identify frequent error types. Our evaluation across open and closed-source LLMs highlights multiple key findings: LLMs excelling on standard benchmarks perform poorly in LaTeX generation with a significant accuracy drop-off as the complexity of tasks increases; open-source models like DeepSeek v3 and DeepSeek Coder strongly rival closed-source counterparts in LaTeX tasks; and formatting and package errors are unexpectedly prevalent, suggesting a lack of diverse LaTeX examples in the training datasets of most LLMs. Our dataset, code, and model evaluations are available at https://github.com/knowledge-verse-ai/TeXpert.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

doi: 10.18653/v1/2025.sdp-1.2

2506.1699

Country: North America > Mexico (0.28)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

FLeW: Facet-Level and Adaptive Weighted Representation Learning of Scientific Documents

Dou, Zheng, Wang, Deqing, Zhuang, Fuzhen, Ren, Jian, Hu, Yanlin

arXiv.org Artificial IntelligenceSep-10-2025

Scientific document representation learning provides powerful embeddings for various tasks, while current methods face challenges across three approaches. 1) Contrastive training with citation-structural signals underutilizes citation information and still generates single-vector representations. 2) Fine-grained representation learning, which generates multiple vectors at the sentence or aspect level, requires costly integration and lacks domain generalization. 3) Task-aware learning depends on manually predefined task categorization, overlooking nuanced task distinctions and requiring extra training data for task-specific modules. To address these problems, we propose a new method that unifies the three approaches for better representations, namely FLeW. Specifically, we introduce a novel triplet sampling method that leverages citation intent and frequency to enhance citation-structural signals for training. Citation intents (background, method, result), aligned with the general structure of scientific writing, facilitate a domain-generalized facet partition for fine-grained representation learning. Then, we adopt a simple weight search to adaptively integrate three facet-level embeddings into a task-specific document embedding without task-aware fine-tuning. Experiments show the applicability and robustness of FLeW across multiple scientific tasks and fields, compared to prior models.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2509.07531

Country: Asia > China (0.15)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

CEDe: A collection of expert-curated datasets with atom-level entity annotations for Optical Chemical Structure Recognition

Neural Information Processing SystemsAug-17-2025, 16:05:10 GMT

Chemical information in scientific literature is commonly presented in various ways, such as text, tables, charts, and images.

machine learning, natural language, pattern recognition, (18 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Orange County > Irvine (0.04)
Europe > United Kingdom > England > West Midlands > Birmingham (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(2 more...)

Industry:

Law > Intellectual Property & Technology Law (0.69)
Health & Medicine > Pharmaceuticals & Biotechnology (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Data Science (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (0.93)
(2 more...)

Add feedback

Preface to the Special Issue of the TAL Journal on Scholarly Document Processing

Boudin, Florian, Aizawa, Akiko

arXiv.org Artificial IntelligenceJun-5-2025

The rapid growth of scholarly literature makes it increasingly difficult for researchers to keep up with new knowledge. Automated tools are now more essential than ever to help navigate and interpret this vast body of information. Scientific papers pose unique difficulties, with their complex language, specialized terminology, and diverse formats, requiring advanced methods to extract reliable and actionable insights. Large language models (LLMs) offer new opportunities, enabling tasks such as literature reviews, writing assistance, and interactive exploration of research. This special issue of the TAL journal highlights research addressing these challenges and, more broadly, research on natural language processing and information retrieval for scholarly and scientific documents.

computational linguistic, large language model, natural language, (12 more...)

arXiv.org Artificial Intelligence

2506.03587

Country:

Europe > France (0.30)
North America > United States > Florida > Miami-Dade County > Miami (0.15)
Europe > United Kingdom > Scotland (0.14)
Asia > Middle East > UAE (0.14)

Genre:

Overview (1.00)
Collection > Journal > Special Issue (0.72)
Research Report > Experimental Study (0.69)

Industry: Health & Medicine (0.47)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.75)

Add feedback

AdaParse: An Adaptive Parallel PDF Parsing and Resource Scaling Engine

Siebenschuh, Carlo, Hippe, Kyle, Gokdemir, Ozan, Brace, Alexander, Khan, Arham, Hossain, Khalid, Babuji, Yadu, Chia, Nicholas, Vishwanath, Venkatram, Stevens, Rick, Ramanathan, Arvind, Foster, Ian, Underwood, Robert

arXiv.org Artificial IntelligenceMay-6-2025

Language models for scientific tasks are trained on text from scientific publications, most distributed as PDFs that require parsing. PDF parsing approaches range from inexpensive heuristics (for simple documents) to computationally intensive ML-driven systems (for complex or degraded ones). The choice of the "best" parser for a particular document depends on its computational cost and the accuracy of its output. To address these issues, we introduce an Adaptive Parallel PDF Parsing and Resource Scaling Engine (AdaParse), a data-driven strategy for assigning an appropriate parser to each document. We enlist scientists to select preferred parser outputs and incorporate this information through direct preference optimization (DPO) into AdaParse, thereby aligning its selection process with human judgment. AdaParse then incorporates hardware requirements and predicted accuracy of each parser to orchestrate computational resources efficiently for large-scale parsing campaigns. We demonstrate that AdaParse, when compared to state-of-the-art parsers, improves throughput by $17\times$ while still achieving comparable accuracy (0.2 percent better) on a benchmark set of 1000 scientific documents. AdaParse's combination of high accuracy and parallel scalability makes it feasible to parse large-scale scientific document corpora to support the development of high-quality, trillion-token-scale text datasets. The implementation is available at https://github.com/7shoe/AdaParse/

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2505.01435

Country: North America > United States > Illinois (0.28)

Genre: Research Report (1.00)

Industry:

Health & Medicine (0.68)
Information Technology > Security & Privacy (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Can Large Language Models Serve as Effective Classifiers for Hierarchical Multi-Label Classification of Scientific Documents at Industrial Scale?

Tabatabaei, Seyed Amin, Fancher, Sarah, Parsons, Michael, Askari, Arian

arXiv.org Artificial IntelligenceDec-6-2024

We address the task of hierarchical multi-label classification (HMC) of scientific documents at an industrial scale, where hundreds of thousands of documents must be classified across thousands of dynamic labels. The rapid growth of scientific publications necessitates scalable and efficient methods for classification, further complicated by the evolving nature of taxonomies--where new categories are introduced, existing ones are merged, and outdated ones are deprecated. Traditional machine learning approaches, which require costly retraining with each taxonomy update, become impractical due to the high overhead of labelled data collection and model adaptation. Large Language Models (LLMs) have demonstrated great potential in complex tasks such as multi-label classification. However, applying them to large and dynamic taxonomies presents unique challenges as the vast number of labels can exceed LLMs' input limits. In this paper, we present novel methods that combine the strengths of LLMs with dense retrieval techniques to overcome these challenges. Our approach avoids retraining by leveraging zero-shot HMC for real-time label assignment. We evaluate the effectiveness of our methods on SSRN, a large repository of preprints spanning multiple disciplines, and demonstrate significant improvements in both classification accuracy and cost-efficiency. By developing a tailored evaluation framework for dynamic taxonomies and publicly releasing our code, this research provides critical insights into applying LLMs for document classification, where the number of classes corresponds to the number of nodes in a large taxonomy, at an industrial scale.

classification, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2412.05137

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > New Mexico (0.04)
Europe > Netherlands > South Holland > Leiden (0.04)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

StructFormer: Document Structure-based Masked Attention and its Impact on Language Model Pre-Training

Ponkshe, Kaustubh, Subramanian, Venkatapathy, Modani, Natwar, Ramakrishnan, Ganesh

arXiv.org Artificial IntelligenceNov-25-2024

Most state-of-the-art techniques for Language Models (LMs) today rely on transformer-based architectures and their ubiquitous attention mechanism. However, the exponential growth in computational requirements with longer input sequences confines Transformers to handling short passages. Recent efforts have aimed to address this limitation by introducing selective attention mechanisms, notably local and global attention. While sparse attention mechanisms, akin to full attention in being Turing-complete, have been theoretically established, their practical impact on pre-training remains unexplored. This study focuses on empirically assessing the influence of global attention on BERT pre-training. The primary steps involve creating an extensive corpus of structure-aware text through arXiv data, alongside a text-only counterpart. We carry out pre-training on these two datasets, investigate shifts in attention patterns, and assess their implications for downstream tasks. Our analysis underscores the significance of incorporating document structure into LM models, demonstrating their capacity to excel in more abstract tasks, such as document understanding.

document structure, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2411.16618

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.05)
North America > United States (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
(3 more...)

Genre: Research Report > Promising Solution (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.93)

Add feedback

Efficient Few-shot Learning for Multi-label Classification of Scientific Documents with Many Classes

Schopf, Tim, Blatzheim, Alexander, Machner, Nektarios, Matthes, Florian

arXiv.org Artificial IntelligenceOct-21-2024

Scientific document classification is a critical task and often involves many classes. However, collecting human-labeled data for many classes is expensive and usually leads to label-scarce scenarios. Moreover, recent work has shown that sentence embedding model fine-tuning for few-shot classification is efficient, robust, and effective. In this work, we propose FusionSent (Fusion-based Sentence Embedding Fine-tuning), an efficient and prompt-free approach for few-shot classification of scientific documents with many classes. FusionSent uses available training examples and their respective label texts to contrastively fine-tune two different sentence embedding models. Afterward, the parameters of both fine-tuned models are fused to combine the complementary knowledge from the separate fine-tuning steps into a single model. Finally, the resulting sentence embedding model is frozen to embed the training instances, which are then used as input features to train a classification head. Our experiments show that FusionSent significantly outperforms strong baselines by an average of $6.0$ $F_{1}$ points across multiple scientific document classification datasets. In addition, we introduce a new dataset for multi-label classification of scientific documents, which contains 203,961 scientific articles and 130 classes from the arXiv category taxonomy. Code and data are available at https://github.com/sebischair/FusionSent.

classification, information retrieval, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2410.0577

Country:

North America > United States > New York > New York County > New York City (0.05)
Asia > Singapore (0.04)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
(16 more...)

Genre: Research Report (1.00)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.93)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.69)

Add feedback

Synthesizing Scientific Summaries: An Extractive and Abstractive Approach

Sharma, Grishma, Paretkar, Aditi, Sharma, Deepak

arXiv.org Artificial IntelligenceJul-29-2024

The availability of a vast array of research papers in any area of study, necessitates the need of automated summarisation systems that can present the key research conducted and their corresponding findings. Scientific paper summarisation is a challenging task for various reasons including token length limits in modern transformer models and corresponding memory and compute requirements for long text. A significant amount of work has been conducted in this area, with approaches that modify the attention mechanisms of existing transformer models and others that utilise discourse information to capture long range dependencies in research papers. In this paper, we propose a hybrid methodology for research paper summarisation which incorporates an extractive and abstractive approach. We use the extractive approach to capture the key findings of research, and pair it with the introduction of the paper which captures the motivation for research. We use two models based on unsupervised learning for the extraction stage and two transformer language models, resulting in four combinations for our hybrid approach. The performances of the models are evaluated on three metrics and we present our findings in this paper. We find that using certain combinations of hyper parameters, it is possible for automated summarisation systems to exceed the abstractiveness of summaries written by humans. Finally, we state our future scope of research in extending this methodology to summarisation of generalised long documents.

abstractiveness, case frame, summarization, (16 more...)

arXiv.org Artificial Intelligence

2407.19779

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
North America > Dominican Republic (0.04)
Europe > Italy > Tuscany > Florence (0.04)
(4 more...)

Genre:

Overview (0.93)
Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

Artificial Intuition: Efficient Classification of Scientific Abstracts

Sakhrani, Harsh, Pervez, Naseela, Kumar, Anirudh Ravi, Morstatter, Fred, Reed, Alexandra Graddy, Belz, Andrea

arXiv.org Artificial IntelligenceJul-8-2024

It is desirable to coarsely classify short scientific texts, such as grant or publication abstracts, for strategic insight or research portfolio management. These texts efficiently transmit dense information to experts possessing a rich body of knowledge to aid interpretation. Yet this task is remarkably difficult to automate because of brevity and the absence of context. To address this gap, we have developed a novel approach to generate and appropriately assign coarse domain-specific labels. We show that a Large Language Model (LLM) can provide metadata essential to the task, in a process akin to the augmentation of supplemental knowledge representing human intuition, and propose a workflow. As a pilot study, we use a corpus of award abstracts from the National Aeronautics and Space Administration (NASA). We develop new assessment tools in concert with established performance metrics.

classification, keyword, label space, (15 more...)

arXiv.org Artificial Intelligence

2407.06093

Country:

North America > United States > New York > New York County > New York City (0.05)
Asia > China > Hong Kong (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)

Genre: Research Report (0.69)

Industry:

Government > Space Agency (0.69)
Energy > Renewable (0.68)
Government > Regional Government > North America Government > United States Government (0.55)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.94)

Add feedback